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Section  I 

/  Introduction 

¥ 

This  report  discusses  progress  on  the  development  of  a  potentially  low-cost, 
highly  integrated  vocoder  based  on  the  Belgard  algorithm.  Work  is  reported  in 
two  areas:  The  continuing  development  of  the  channel  bank  analyzer  and  synthe¬ 
sizer  integrated  circuits  end  the  initial  work  to  develop  a  pitch  tracking 
chip.  .These  two  areas  of  research  are  described  in  Section  II  and  Section  III, 
respec^vely.  Each  section  is,  for  the  most  part,  self-contained,  but  because 
Sectional  addresses  only  changes  in  the  Belgard  ICs,  the  final  report  for  DARPA 
Contract  Nov  N00173“77“C-0100,  published  in  January  1979,  may  be  referred  to  for 
a  complete  picture  of  the  present  design. 


T 


'T  ."i?  '."i.  .’ 


Section  II 

Channel  Bank  Vocoder  Integrated  Circuit  Development 

This  section  describes  the  details  of  the  first  IC  redesign;  the  subsequent 
evaluation;  and  the  second  redesign,  just  complete  at  this  writing.  The  de¬ 
scriptions  of  the  analyzer  and  synthesizer  are  separate.  In  each,  the  history 
of  one  circuit  block  at  a  time  is  discussed  chronologically,  beginning  with  a 
summary  of  the  problems  of  its  initial  design.  This  section  is  not  meant  to 
be  a  complete  IC  documentation.  Detail  is  given  relating  only  to  changes; 
circuit  blocks  with  acceptable  performance  on  the  initial  design  are  not 
discussed  here.  Complete  documentation  will  be  presented  in  a  future  report 
when  test  results  of  the  final  designs  have  been  obtained. 

A,  Channel  Bank  Analyzer 

All  the  problems  associated  with  the  bandpass  filters  in  the  initial  design 
were  associated  with  the  output  amplifier  (differential  charge  amplifier,  DCI). 
The  DCI  had  poor  common  mode  rejection  that  was  very  sensitive  to  a  required 
external  bias  voltage.  In  addition,  the  switched  capacitor  feedback  clocks 
were  phased  improperly,  resulting  in  a  reduced  voltage  on  the  CCD  sense  gates, 
and  hence,  a  reduced  CCD  signal  capacity.  The  filters  were  sufficiently 
functional  to  determine  that  the  center  frequencies  and  bandwidths  were  accept¬ 
able. 

The  filter  redesign  consisted  of  replacing  the  DCI  with  another  differential 
amplifier  circuit  that  required  no  external  bias  and  correcting  the  clock  phase 
in  the  feedback  loop.  These  redesigned  filters  performed  as  predicted,  and  no 
further  modifications  are  required.  Schematics  of  the  DCI  circuit  and  the  feed¬ 
back  clock  phase  are  shown  in  Figure  1  and  2,  respectively. 


A  simplified  schematic  of  the  half-wave  rectifier  as  original  I y  designed 
is  shown  in  Figure  3.  During  Rl,  current  flows  through  the  rectifying  transistor 


At  the  end  of  R1  the  feedback 


MR,  while  the  rectiffer  input  is  set  to 
loop  should  stabilize  with  MR  Just  at  threshold.  Any  increase  of  the  gate 
voltage  of  MR  during  SI  will  raise  node  C,  where  a  decrease  at  B  leaves  C 
unaffected. 

There  were  two  major  flaws  in  the  original  design.  One  flaw  was  that  Rl 
pulled  too  much  current  through  MR.  This  resulted  in  an  unstable  feedback  loop. 

MR  did  not  settle  to  its  threshold  condition  after  reset,  but  was  turned  off 
by  ~  100  mV  instead.  This  problem  prevented  input  signals  with  amplitudes 
less  than  100  mV  from  being  rectified,  but  this  effect  was  masked  by  the  other 
design  flaw.  The  second  problem  was  that,  because  the  Rl  transistor  was  connected 
to  a  low  impedance  source  and  the  SI  transistor  was  not,  the  clock  feedthroughs 
to  node  A  differ.  When  Rl  turned  off,  node  A  did  not  couple  down  as  much  as  it 
coupled  up  when  SI  turned  on.  The  rectifier  interpreted  this  as  signal  and 
rectified  it.  As  a  result,  the  rectifier  output  followed  any  signal  greater 
than  'x-400  mV. 

A  schematic  of  the  redesign  rectifier  is  shown  in  Figure  4.  The  extra 
bias  supply  was  replaced  by  a  current-limiting  circuit,  MD  and  ME.  These  tran¬ 
sistors  were  designed  to  carry  0.5  piA  during  Rl .  However,  MD  did  not  behave 
as  predicted,  resulting  in  only  0.2  nA.  This  current  was  insufficient  to  slew 
node  c  back  to  threshold  after  rectifying  large  signals.  Thus,  the  rectifier 
was  saturated  with  2  V  input  signals  rather  than  the  5  V  design  goal.  This 
problem  will  be  overcome  in  the  next  version  by  modifying  MD  and  by  connecting 
its  gate  to  a  bond  pad  so  that  an  external  bias  can  be  applied  if  necessary. 

The  input  structure  was  modified  to  equalize  the  clock  coupling  from  Rl 
and  SI.  The  Rl  transistor  was  no  longer  connected  to  a  low  impedance  source. 
However  there  was  still  a  problem  due  to  clock  timing.  Node  A  was  coupling 
down  before  the  feedback  loop  completely  settled;  consequently,  part  of  the 
clock  coupling  was  passed  to  node  C,  and  when  SI  turned  on,  MR  began  conducting 
before  node  A  reached  To  correct  this  problem  Rl  will  be  replaced  by  R2 

in  the  next  version.  This  will  allow  the  feedback  loop  to  settle  before  node  A 
couples  down. 
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The  schematic  of  the  lowpass  filter  as  it  existed  in  both  the  original  design 
and  the  redesign  is  shown  in  Figure  5.  Switches  SI  and  Ll  operate  a  single¬ 
pole  section  while  S3  and  L2  control  a  second-order  section.  Transistor  pairs 
(M21  and  M22)  and  (M23  and  M24)  are  source-follower  buffer  amplifiers  that 
generate  offsets  (differences  between  input  and  output  levels).  The  scheme 
originally  proposed  to  eliminate  these  offsets  required  clocks  R3  and  S2. 

When  the  reference  level  is  on  the  lowpass  input,  S2  turns  off  and  R3  turns  on. 

This  passes  the  reference  level  through  each  of  the  buffers,  accumulating  the 
offsets.  Clock  R4  turns  on,  storing  the  offset  on  the  5.63  pF  coupling 
capacitor.  Then  R3  turns  off  and  S2  turns  on.  The  problem  with  the  performance 
of  the  design  is  that  there  is  an  offset  generated  by  the  clocks  as  well  as  the 
buffer  amplifiers.  In  the  original  design  the  transistors  were  unnecessarily 
large.  The  redesign  used  minimum  geometry  transistors  with  shield  gates  to 
reduce  the  gate  to  source  capacitance,  and  hence,  the  clock  coupling.  The 
offset  was  reduced  in  the  redesign,  but  because  the  redesigned  analog  multi¬ 
plexer  had  additional  gain,  the  net  effect  of  the  offset  was  worse  in  the  redesign. 

The  next  version  has  the  modified  topology  shown  in  Figure  6.  There  will 
now  be  two  lowpass  filters  in  each  channel.  One  will  filter  the  "zero  signal"  level, 
and  the  other  will  filter  the  signal.  They  share  buffer  amplifiers  and  are 
located  in  very  close  proximity,  so  their  offsets  should  be  very  well  matched. 

The  offsets  should  cancel  when  the  correlated  sampling  circuit  at  the  output 
subtracts  "zero  signal"  from  the  signal  filter  output. 

The  price  paid  for  this  dual  filter  approach  is  twofold.  Obviously,  the  dual 
filter  requires  nearly  twice  as  much  area  as  the  original  design.  The  other, 
and  equally  severe  cost  is  the  increased  clocking  complexity.  The  clock  timing 
diagram  is  shown  in  Figure  7.  Twelve  clocks  are  required,  in  contrast  to  the  six 
clocks  needed  in  the  original  design. 


Figure  7  Clock  Timing  in  Most  Recent  Design 


A  schematic  diagram  of  the  original  analog  multiplexer  design  is  shown  in 
Figure  8(a).  During  the  lowpass  reference  period,  r4  turned  on  to  store  the 


offset  on  the  coupling  capacitor  C^.  A  50  Hz  extekOdlly  supplied  synchronization 


signal  generated  clock  S50  during  the  signal  period  of  the  lowpass.  This  stored 
the  lowpass  output  [divided  down  by  C^/(C^  +  C^)]  on  the  storage  capacitor  C^. 


Each  of  the  19  storage  capacitors  was  then  separately  selected  with  an  signal 


for  amplification  and  ana1og-to-digita1  conversion.  Prior  to  sampling  another 
channel,  the  charge  associated  with  the  previous  channel  was  eliminated  with  the 


Ra/o  clock. 


Several  problems  were  associated  with  this  initial  design.  Because  operational 
amplifier  offsets  are  unpredictable,  the  original  plan  called  for  a  separate, 
externally  adjustable  bias  supply,  MUX  REF.  This  supply  was  adjusted  such  that 


the  "zero  signal"  stored  on  corresponded  to  the  offset  of  the  amplifier.  While 


this  scheme  was  workable,  the  separate  supply  was  cumbersome  and,  in  principle, 
unnecessary.  In  Figure  8(b)  the  redesigned  multiplexer  schematic  shows  that  the 
lowpass  signal  with  respect  to  the  amplifier  offset,  not  ground,  was  stored  on 


C^.  This  redesigned  topology  eliminates  the  extra  supply  requirement,  but  inverts 


the  signal.  The  reinversion  was  accomplished  with  a  second  amplifier  that  was 
required  to  provide  additional  gain  as  well.  The  redesigned  multiplexer  had 
26  dB  gain,  compared  to  13  dB  in  the  original. 


Another  problem  was  caused  by  the  asynchronous  timing  of  the  $50  clock.  In 
order  to  minimize  the  delay  time  between  the  external  synchronization  pulse  and 
the  ADC  output,  the  clocks  controlling  the  output  were  synchronized  to  the  first 
complete  10  kHz  cycle  on  the  chip.  This  meant  that  the  $50  pulse  could  occur  in 
any  one  of  ten  time  positions  relative  to  the  1  kHz  clocks.  On  that  one-in-ten 
occasion  when  $50  coincided  with  the  $3  clock  the  outputs  of  all  channels  were 
offset.  The  redesigned  circuit  had  timing  modifications  to  prevent  $50  from 
coinciding  with  $3. 


CHANNELS 
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Figure  8  (a)  Original  Multiplexer  Design 
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The  only  other  difficulty  associated  with  the  multiplexer  was  its  suscepti¬ 
bility  to  leakage  current.  The  amplifier  inverting  input  node  is  connected  to  a 

2 

bus  line  to  all  19  channels.  The  bus  had  a  junction  area  of  nearly  30  mil  , 
which  produced  leakage  current  discharging  the  input  node  and  adding  a  ramp  to 
the  signal.  The  redesign  reduced  the  junction  area  of  the  bus  to  approximately 
one-quarter  of  the  original  area. 

The  results  of  the  redesign  were  poor.  The  increased  gain  exacerbated  each 
of  the  problems.  The  multiplexer  was  13  dB  more  sensitive  to  leakage  current, 
and  since  the  redesign  did  not  reduce  the  junction  area  by  that  amount,  the 
problem  grew  worse.  The  asynchronous,  S50-caused  offset  was  also  still  present 
and  much  larger.  In  addition,  elimination  of  the  separate  power  supply  removed 
the  one  adjustment  that  could  have  made  the  redesign  functional.  Not  only  can 
the  extra  supply  compensate  for  the  multiplexer  offset,  but  it  can  also  compensate 
for  the  average  lowpass  offset  as  well.  With  the  additional  gain  the  lowpass 
offsets  were  sufficient  to  saturate  the  amplifiers.  In  addition,  the  reset  noise 
of  the  first  amplifier  was  amplified  in  the  redesign  to  an  unacceptable  level, 
a  problem  not  encountered  in  the  original  design. 


A  schematic  diagram  of  the  newest  analog  multiplexer  is  shown  in  Figure  9. 

The  circuit  is  very  similar  to  the  first  redesign,  with  three  major  differences. 

To  avoid  the  asynchronous  offset  problem,  the  new  design  will  always  sample  a 
frame  of  speech  at  1  kHz  unless  the  external  synchronization  signal  is  received, 
at  which  time  frame  sampling  ceases  until  readout  is  complete  in  order  to  avoid 
skewing  the  data.  The  other  major  differences  are  the  symmetric,  fully  differential 
configuration  of  the  amplifiers  and  the  sample-and-hold  function  performed  by 
LsR9  reference  and  on  the  signal.  Note  also  that  the  second  gain  stage 

is  now  reset  by  instead  of  biased  with  a  switched  capacitor  feedback.  This 

ensures  that  the  reset  noise  of  the  first  amplifier  will  be  stored  on  the  coupling 
capacitor  and  unobserved  at  the  output.  The  reset  noise  of  the  second  amplifier 
still  remains,  but  because  its  gain  is  significantly  lower,  its  level  should  be 
acceptably  low. 
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Finally,  if  all  the  efforts  to  remove  offsets  in  the  lowpass  and  multiplexer 
circuits  fail  to  solve  the  problems,  a  separate  adjustable  bias  supply  is  added 
to  leak  through  a  switched  capacitor  a  small  amount  of  charge  to  compensate  the 
average  offset. 

B.  Channel  Bank  Synthesizer 

The  original  syntheizer  IC  was  nonfunctional  due  to  a  layout  error  in  the 
pitch  counter  which  prevented  the  pitch  word  from  being  loaded  into  the  counter. 
Figure  10  contains  a  schematic  diagram  of  the  two  MSB  stages  of  the  latest  pitch 
counter.  In  the  original  design  the  FRAME  END  signal  occurred  after  PI  and 
before  P2.  Thus,  the  input  data  were  replaced  by  the  existing  data  by  the  P2 
clock  before  it  could  be  stored  during  Pi.  The  schematic  in  Figure  10  indicates 
the  clock  timing  on  the  revision.  The  performance  of  the  revised  pitch  counter 
was  good  except  that  clock  feedthrough  from  P2  to  the  floating  node  (indicated 
in  Figure  10)  caused  spurious  counter  load  pulses.  Decreasing  the  amplitude  of 
P2  to  12  V  reduced  the  feedthrough  enough  for  the  pitch  counter  to  operate  correctly 
In  the  more  recent  redesign  the  P2  clock  was  shielded  from  the  floating  node  by 
a  sheet  of  polysilicon  so  that  the  full  15  V  clock  amplitudes  can  be  used. 

There  were  two  problems  in  the  original  design  of  the  noise  generator.  One 
was  that  the  rise  time  of  the  excitation  pulse  was  too  long.  This  was  successfully 
corrected  on  the  revision  by  adjusting  the  sizes  of  the  pullup  transistors.  The 
present  driver  circuit  is  shown  in  Figure  11.  The  other  problem  in  the  design  was 
that  the  voiced/unvoiced  decision  inhibited  the  generation  of  positive  noise 
pulses,  but  not  negative  pulses.  The  revision,  shown  in  Figure  12,  worked  well. 

The  symbol  x  represents  pitch,  positive  noise,  or  negative  noise  excitation. 


The  one  DAC  layout  problem  resulting  in  5  V  clock  signals  in  the  capacitor 
array  was  corrected  in  the  first  revision.  No  further  modifications  were 
necessary. 


Data 
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Figure  13  depicts  the  orignal  design  of  the  signal  path  of  a  single 
channel  through  the  bandpass  filter.  The  DAC  and  unity  gain  buffers  contribute 
offsets.  The  modulator  is  designed  to  eliminate  the  total  accumulated  offset. 
During  R3,  theR4  clock  stores  on  the  coupling  capacitors  (C^,  and  C^_) 


-  where  represents  the  lowpass  output  during  the 


the  voltage 

reference  half-cycle.  Then,  when  R3  turns  off,  the  bandpass  filter  samples 
V-.  Note  that  the  bandpass  does  not  sample  during  R3.  When  an  excitation 

D 1  dS  L 

pulse  occurs  (p,  n^,  or  n  ),  the  lowpass  output  is  V^p,  so  the  voltage  step 
sampled  by  the  bandpass  is  Vj^p  -  (V^  -  -  '^gjas  2*  lowpass 

output  corresponding  to  silence.  The  bias  voltages  are  externally  adjusted  such 
that 

'^Z  “  '^0  ”  '^BiasZ  "  '^Bias  1' 

This  condition  assures  that  the  bandpass  samples  no  excitation  during  silence, 
despite  the  offsets  in  the  signal  path. 


In  the  original  design  a  large  50  Hz  signal  was  present  at  the  modulator 
input  when  silence  was  requested.  This  spurious  50  Hz  signal  was  reduced  in 
magnitude  by  approximately  26  dB  by  replacing  the  single  R3  clock  by  four  clocks 
with  staggered  falling  edges.  The  clocks  were  revised  to  turn  off  in  progression, 
beginning  with  the  switch  closest  to  the  DAC  and  ending  with  the  reset  switch  in  the 
second  stage  of  the  lowpass.  The  staggered  falling  edges  were  accomplished  with 
simple  inverters,  and  a  layout  error  in  one  inverter  occurred  in  the  revision. 

(This  layout  error  is  corrected  in  the  latest  revision.)  With  the  aid  of  an 
external  inverter  the  circuit  was  functional.  The  revised  circuit  had  a  silence 
offset  sufficiently  small  that  it  could  be  eliminated  with  the  two  bias  supply 
approach.  The  remaining  problem  in  the  revision  was  that  the  silence  offsets  of  the 
even  channels  were  different  from  those  of  the  odd  channels.  This  effect  is 
understandable  because  the  even  and  odd  channels  are  laid  out  as  mirror  images. 

Any  misregistration  of  mask  levels  during  fabrication  can  cause  opposite  effects 
on  opposite  halves  of  the  chip.  In  order  to  provide  offset  cancellation  for  all 
channels  simultaneously,  another  bias  supply  has  been  added  to  the  latest  design. 
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The  bandpass  filters  on  the  original  design  had  some  conceptual  as  well  as 
layout  errors.  Layout  errors  prevented  channel  19  from  functioning  and  reduced 
the  gain  of  channel  16.  The  filter  topology  is  shown  schematically  in  Figure 
14.  The  capacitor  controlling  circuit  Q,  is  sensitive  to  parallel 

parasitic  capacitance.  The  other  problem  with  the  original  design  was  that 
the  capacitor  ratios  were  determined  using  the  S-plane  to  Z-plane  mapping, 

S  =  Z  -  1.  Because  the  filter  sampling  rate  has  been  insufficiently  high, 
this  approximation  is  not  adequate.  As  a  result,  the  center  frequencies  and 
bandwidths  significantly  differed  from  those  required. 

In  the  revised  chip  the  layout  errors  were  corrected,  the  capacitor  ratios 
were  more  accurately  chosen,  and  the  filter  topology  was  slightly  modified 
to  reduce  the  performance  sensitivity  to  parasitic  capacitance.  The  revised 
bandpass  is  shown  in  Figure  15.  Note  the  difference  in  the  Q-control 1 ing 
capacitor  configuration  and  the  clock  timing  change  in  that  same  vicinity. 

A  small,  but  representative,  set  of  performance  data  is  summarized  in  Table  1. 

The  center  frequencies  are  within  12  Hz  of  their  design  goals  and  the  band- 
widths  within  20  Hz.  Notice  that  there  is  a  systematic  increase  of  bandwidth 
with  increasing  frequency.  It  is  thought  that  this  effect  is  due  to  capacitor 
matching  errors,  and  the  later  revision  has  attempted  to  compensate  for  this 
problem.  The  most  serious  problem  in  the  revised  circuit  is  the  low  dynamic 
range,  only  30  dB.  In  the  newly  revised  version  12  dB  improvement  should  be 
obtained  by  using  lower  noise  operational  amplifiers,  and  an  additional  6  dB 
increase  should  be  gained  by  increasing  the  filter  gains  at  the  center  frequencies 
by  that  amount. 

The  only  other  problem  with  the  synthesizer  occurred  in  the  summing  ampli¬ 
fier.  A  schematic  diagram  of  the  revised  circuit  is  contained  in  Figure  16. 

It  is  a  fully  differential  configuration  to  sum  the  even  and  odd  channel  signals 
in  opposite  phase.  However,  there  are  10  odd  channels  and  only  nine  even  ones. 

The  parasitic  capacitance  is  different  on  the  inverting  and  noninverting  nodes 
as  well.  As  a  result,  the  odd  channels  have  2  dB  lower  gain  than  the  even  channels 

In  the  latest  design  the  summing  amplifier  configuration  is  symmetric.  There 
should  be  identical  gains  for  all  channels. 
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Table  1 

Synthesizer  Filter  Performance  Summary 


channel 

fg  (Avg.) 

0 

BW  (Avg.) 

‘^BW 

1 

236.2 

2.8 

46.3 

0. 16 

2 

355.9 

1.3 

47.5 

0.21 

3 

472.5 

0.7 

48.2 

0.14 

4 

595.3 

23.3 

49.1 

0.86 

5 

711.0 

2.9 

50.2 

0.49 

6 

830.8 

3.4 

50.2 

0.26 

7 

990.1 

5.6 

52.2 

0.11 

8 

1140.7 

4.2 

52.6 

0.31 

9 
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2.0 
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10 
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11 
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12 

1794.0 

10.2 

77.4 
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13 
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Figure  16  Summing  Output  Amplifier 
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Section  III 

Low  Cost  pitch  Tracking  Development 


A.  Introduction 

A  schedule  of  the  major  tasks  to  be  performed  in  the  DARPA  low  cost  pitch 
tracking  program  is  given  in  Table  2.  The  following  tasks  have  been  completed 
at  this  time: 

•  The  data  base  has  been  collected,  edited,  and  digitized  onto  disc. 

•  Software  programs  have  been  written  to  downsample  the  data  from 
12.5  kHz  to  both  10  kHz  and  8  kHz.  A  software  program  has  been 
written  to  add  noise  to  the  data. 

•  A  description  of  the  baseline  algorithms  has  been  completed  and 
is  included  for  the  Gold-Rabiner,  harmonic,  and  correlation  pitch 
trackers.  A  more  complete  description  of  the  data  base  and  the 
baseline  algorithms  follows  in  the  next  two  sections. 

B.  Data  base 

The  pitch  base  has  been  collected,  digitized,  and  edited  onto  disc.  The 
data  set  consists  of  58  speakers,  32  male  and  26  female  ranging  in  age  from 
six  years  to  87  years.  Each  subject  read  a  series  of  11  utterances  of  approxi¬ 
mately  three  seconds  in  duration.  The  last  six  sentences  were  constructed  to 
contain  approximately  70  within-word  phoneme  transitions.  A  complete  description 
of  the  data  base  is  given  in  Appendix  A. 

C.  Baseline  algorithms 

Description  of  the  Gold-Rabiner,  harmonic,  and  correlation  pitch  trackers 
are  presented.  These  descriptions  represent  the  baseline  algorithms  that  will 
provide  the  base  for  modifications  to  achieve  an  LSI  implementation.  In  each 
case,  an  attempt  was  made  to  implement  the  algorithm  as  presented  in  the  litera¬ 
ture,  with  some  modifications  to  aid  in  the  software  simulation. 


27 


•  Harmonic  Pitch  Tracker 

1  2 

The  basic  algorithm  is  given  in  the  papers  by  S.  Seneff  and  P.  Bosshart,  ’ 
Modifications  have  been  made  in  the  method  of  obtaining  the  Fourier 
transform  of  the  data.  The  reasons  for  the  modifications  are  to  allow 
a  change  in  sample  rate  and  to  allow  a  greater  range  of  pitch  frequencies 
to  be  considered. 

Once  every  FP  milliseconds  a  frame  of  data  will  be  processed  in  the 
following  manner  (typical  value  for  FP  is  10  ms): 

A.  Preprocess  the  data. 

1.  Get  a  frame  of  data  (WL  milliseconds  long^  typical  value 
for  WL  is  38  ms) . 

2.  Preemphasize  the  data.  (Preemphasis  constant  =  0.8.) 

3.  Calculate  the  squared  energy  in  the  frame  of  data  as: 


nx 


i  -  I 


where  nx  *  number  of  samples  in  frame  of  data.  If  the 
squared  energy  is  less  than  a  fixed  threshold,  set  the 
frame  of  speech  data  to  unvoiced  (pitch  =  0),  and  return 
to  Step  I.  A  typical  voicing  threshold  is  10,000. 

4.  Hamming-window  the  frame  of  data  using  a  WL  ms  window. 

5.  Take  the  FFT  of  the  frame  of  Hamming-windowed  data. ' 

Use  an  N  point  transform  where  N  is  such  that  the  frequency 
resolution,  DF,  is  equal  to  6.6  Hz,  i.e.: 

N  =  1./ (sample  period)  (DF) 

For  example,  if  DF  =  6.6  Hz,  sample  period  =  80  microseconds, 
then  N  =  1,894. 

6.  Obtain  the  magnitude  of  the  spectrum  just  calculated. 

B.  Determine  the  pitch  for  the  frame  of  data. 

1.  Find  the  peaks  in  the  spectrum  from  F  .  to  F  .  (typical 

min  max  ' 

values  of  F  .  =  210  Hz;  F  =1,100  Hz) 

min  max 
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Eliminate  spurious  peaks. 

a.  A  peak  is  removed  if  it  is  within  6  samples  (m  40  Hz) 
of  a  larger  neighboring  peak, 

b.  A  peak  that  is  more  than  6  samples,  but  fewer  than  10 
samples,  from  its  nearest  neighbor  is  removed  if  its 
amplitude  is  less  than  one-half  amplitude  of  its  nearest 
neighbor. 

Rank-order  the  remaining  peaks  in  descending  order  of  magni¬ 
tude. 

Iteratively  generate  a  table  of  pitch  values,  with  the  pitch 
values  in  ascending  order  in  each  row.  First  iteration 
(first  row)  -  enter  single  pitch  value  in  table  (distance 
between  two  largest  peaks). 

Second  iteration  (second  row)  -  Third  largest  peak  is  added 
to  the  list  of  peaks  under  consideration  -  two  new  pitch 
estimates  are  added  to  the  table,  defined  as  the  distances 
between  the  adjacent  peaks.  A  score  is  computed  for  the 
maximum  number  of  consecutive  "equal"  pitch  estimates  in  the 
table,  where  "equal"  is  defined  as  being  within  two  samples 
(»  14  Hz)  of  the  Succeeding  entry  in  the  table. 


Nth  iteration  -  iterations  continue  until  at  least  seven 
"equal"  estimates  are  obtained.  If  there  are  fewer  than 
seven  "equal"  estimates,  the  iterations  continue  until  the 
size  of  the  next  available  leftover  peak  is  less  than  one- 
tenth  the  size  of  the  largest  peak,  or  until  a  maximum  of 
seven  peaks  have  been  exhausted.  If  either  of  these  conditions 
is  met,  iterations  stop,  even  though  an  inadequate  score  has 


been  accumulated.  Choose  the  pitch  estimate  with  the  best 
score.  In  case  of  a  tie,  choose  the  larger  pitch  estimate. 

C.  Smoothing  and  voiced/ unvoiced  decision. 

The  above  preprocessing  and  determination  of  pitch  period  is  done 
on  a  frame  basis,  to  obtain  a  pitch  estimate  every  FP  milliseconds. 
The  result  is  an  unsmoothed  pitch  contour.  This  pitch  contour  is 
now  smoothed  and  a  voiced/unvoiced  decision  is  made.  This  is  ac> 
complished  by  passing  the  unsmoothed  pitch  contour  through  a  three- 
point  median  smoother,  followed  by  a  five-point  median  smoother. 

1.  Three-point  median  smoother 

If  none  of  the  three  points  are  "equal,"  then  the  frame  of 
speech  is  unvoiced.  Here,  "equal"  is  defined  as  being  within 
five  samples  of  each  other  (m  33  Hz). 

2.  Five-point  median  smoother 

This  smoother  uses  as  input  the  output  of  the  three-point 
median  smoother.  If  no  more  than  two  of  the  five  input 
samples  are  "equal,"  the  frame  is  unvoiced.  Here,  "equal" 
means  within  three  samples  of  each  other  (•*  20  Hz). 

As  can  be  seen  by  the  above  procedure,  there  will  be  at  least  a 
three-frame  lag  in  outputting  the  pitch  value  for  the  frame  of  data. 


Gold-Rabiner  Pitch  Tracker 

The  basic  Gold-Rabiner  algorithm  is  given  in  the  book  Theory  and  Appli- 

3 

cation  of  Digital  Signal  Processing,  by  Gold  and  Rabiner.  The  algorithm 
which  has  been  implemented  follows  the  outline  given  in  the  paper  by 
Marilyn  Malpass,  "The  Gold-Rabiner  Pitch  Detector  in  a  Real  Time 

If 

Environment,"  except  that  some  of  the  thresholds  have  been  made 
functions  of  the  sample  period. 

The  Gold-Rabiner  algorithm  works  on  a  frame-by-frame  basis; 

1.  Obtain  a  frame  of  speech  (typically  10  milliseconds). 

2.  Low  Pass  Filter  ( LPF)  the  frame  of  speech  («  1,000  Hz); 
a  three-pole  Chebychev  filter  was  used. 


Find  the  maximum  and  minimum  values  of  the  filtered  speech 
within  the  frame  and  check  the  difference  against  an  energy 
threshold  (s  50).  If  the  energy  is  low,  set  frame  to  unvoiced 
and  set  the  periods  and  number  of  samples  since  the  last 
successful  peak  to  the  intial  values  in  each  of  the  six 
channels  (Table  3).  If  the  energy  is  above  the  threshold, 
perform  the  peak  search. 

Search  the  frame  of  data  sample-by-sample  for  peaks.  When  a 
change  in  slope  occurs,  take  the  previous  sample  as  the  peak. 
If  it  is  a  negative  peak,  complement  the  value  (this  result 
may  be  negative  if  the  value  of  the  peak  is  positive).  Store 
the  peak  value  as  the  current  positive  or  negative  peak  and 
take  the  measurements  described  below.  After  each  sample, 
decrement  the  blanking  count  if  greater  than  zero,  increment 
the  number  of  samples  since  the  last  success,  and  update  the 
current  measurement  threshold  (threshold  ■  old  threshold 
times  decay  factor),  if  the  blanking  count  has  reached  zero. 

Do  this  for  each  of  the  six  channel  information  blocks,  and 
return  to  the  peak  search.  Do  this  for  each  of  the  channel 
information  blocks  that  Is  affected  by  the  peak  just  found, 
and  return  to  the  peak  search. 

Take  measurements  Ml,  M2,  M3  (positive  peak)  or  m4,  M5,  M6 
(negative  peak)  and  store  in  respective  channel  blocks;  Ml, 
m4:  peak  value  =  current  positive  or  negative  peak.  M2,  M5: 
peak-valley  =  current  positive  (negative)  peak  plus  previous 
negative  (positive)  peak.  M3,  M6:  peak-peak  *  current  peak 
peak  value  minus  previous  peak  value.  (See  Figure  17.) 

Check  each  of  the  three  measurements  as  follows; 

If  the  blanking  count  is  not  equal  to  zero  or  the  measurement 
is  less  than  the  threshold,  call  the  measurement  a  failure  and 
proceed  to  the  next  measurement.  If  the  measurement  is  a 
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Previous  Measurement  =  0  (used  for  Ml  and  m4) 


Blanking  Count 
Current  Measurement  Threshold 
Decay  Factor 


5  mil  1 i seconds 
1  millisecond 
0 

Exp  (-0.695/P^y) 


Figure  17  Measured  Parameters  of  Filtered  Speech 


success,  store  it  as  the  new  threshold,  slide  periods  and 

P_  to  periods  P  and  P  ,  and  store  the  number  of  samples  since 

b  D  L 

the  last  success  as  period  P^.  If  the  previous  frame  was 

unvoiced,  do  not  change  P  average  the  previous 

frame  was  voiced,  compute  a  new  P^^  =  (old  P^^  +  P^)/2  and 

confine  p.,,  to  be  between  P.,,  and  P  .  (p  .  = 

AV  AV  min  AV  max  '  AV  min 

4  milliseconds  and  P,.,  =  10  milliseconds).  Compute  the 

blanking  count  =  0.4  (P^y)>  store  the  appropriate  decay  factor, 

and  set  the  number  of  samples  since  the  last  success  to  zero. 

[Here  decay  factor  =  exp  (-0.695/P^y)] .  Proceed  to  the  next 

measurement. 

At  the  end  of  the  frame  of  speech  data,  form  a  table  of  36 
pitch  periods  by  storing  P^,  Pg,  P^.,  P^  +  Pg,  Pg  +  P^.,  and 
P^  +  Pj.  from  each  of  the  channel  information  blocks.  The  six 
pitch  period  candidates  are  the  most  recent  periods,  P^, 
from  the  six  channels.  The  pitch  period  of  each  candidate 
being  tested  determines  the  window  of  tolerance.  This  window 
is  a  function  of  the  sampling  period.  Table  4.  A  window  has 
four  "panes"  with  associated  biases.  Each  pitch  period 
candidate,  P  ,  is  compared  to  all  36  values  four  times  as 

l\ 

fo 1 1 ows : 

a. )  Clear  pitch  period  score  (PSCORE)  for  this  candidate. 

b. )  Clear  score  counter  (SCORE). 

c. )  Determine  "pane"  for  pitch  period  candidate. 

d. )  Compare  pitch  period  candidate  against  all  36  values 

in  table,  if  |  1  ^  Panej^,  increment  SCORE, 

n  =  1 , ...  36 . 

e. )  Add  bias  for  this  window  pane  to  SCORE. 

f. )  Compute  NEW  SCORE  =  SCORE  -  THRESHOLD  (THRESHOLD  =  13) 


Table  4 


Windows  Of  Tolerance  And 


Pitch 

Period 

Ranges 

Hill  iseconds) 


16-31 

32-63 

64-127 

128-255 


Panes 


1 

2 

3 

4 

2 

4 

6 

8 

4 

8 

12 

16 

8 

16 

24 

32 

Bias 


Window  1 
Window  2 
Window  3 
Window  4 


g.  Compare  magnitudes  of  NEW  SCORE  and  PSCORE. 

If  I  NEW  SCORE  I  >  1  PSCORE  ],  replace  PSCORE  with  NEW 
SCORE. 

h.  Repeat  stops  b.  through  g.  with  remaining  "panes." 

i.  Save  PSCORE  for  this  pitch  period  candidate. 

j.  Repeat  steps  a.  through  i.  for  each  of  the  remaining 
pitch  period  candidates. 

7.  Pick  the  winning  pitch  period  from  the  six  candidates  by 

choosing  the  highest  score,  PSCORE.  If  the  winning  score 

is  negative  or  if  the  winning  score  is  greater  than 

set  the  voiced/unvoiced  indicator  to  unvoiced.  (P  =  25.5 

max 

milliseconds).  If  the  winning  pitch  period  is  accepted,  se 
the  voiced/unvoiced  indicator  to  voiced. 

Optimized  Correlation  Pitch  Tracker 

The  optimized  correlation  pitch  tracker  is  a  pitch  tracking  algorithm 
developed  at  Texas  Instruments  by  George  Doddington  and  Bruce  Secrest 
and  is  described  in  the  internal  report  "Optimized  Correlation  Pitch 
Tracker  for  Speech  Systems  Applications,"  dated  February  1979. 

The  algorithm  consists  of  three  basic  parts.  First  a  correlation 
technique  is  used  to  obtain  the  periodicities  of  the  speech.  Then 
dynamic  programming  techniques  are  used  to  preserve  continuity  of  the 
pitch  track;  finally,  pattern  matching  is  used  to  obtain  the  voiced/ 
unvoiced  decision.  Since  the  algorithm  has  not  appeared  yet  in  the 
published  literature,  the  internal  technical  report  is  included  as 
Appendix  B. 
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Section  IV 
Summary 

There  is  a  high  confidence  level  that  the  recent  revision  of  the  Belgard 
chips  will  perform  adequately.  The  latest  designs  have  been  completed  and 
submitted  to  the  prototype  photomask  shop.  Both  sets  of  masks  should  be  available 
by  1  March  1980.  Processing  will  proceed  simultaneously  in  both  the  Advanced 
Frontend  Processing  Center  (AFPC)  and  the  Central  Research  Laboratories  (CRL). 

This  decision  was  made  so  that  a  processing  problem  in  either  facility  will  not 
delay  fabrication.  An  optimistic  prediction  of  turnaround  in  the  AFPC  is  five 
weeks.  In  CRL  it  is  a  little  longer.  Turnaround  in  either  facility  will  not 
exceed  seven  weeks  under  normal  circumstances.  This  means  that  devices  will  be 
under  test  sometime  in  the  first  half  of  April. 


The  pitch  tracking  baseline  algorithm  simulations  are  complete,  as  is  the 
data  base  to  be  used  in  performance  evaluation.  Work  is  now  being  concentrated 
on  the  evaluation  technique  itself.  As  the  redesign  of  the  channel  bank  chips 
is  now  complete,  the  iteration  of  pitch  tracking  algorithms,  trading  algorithm 
complexity  for  implementation  ease,  will  begin.  Four  algorithms  will  be  examined 
for  integrated  circuit  implementation  in  the  next  six  months,  with  one  design 
emerging  as  the  best  candidate  for  IC  implementation. 
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Appendix  A 


OESCWIPTIOn  fob  DiTA  BASE  PITCH 


TITLEl  PITCH 

OIWECTOPT  haHEj  (SPCHP.pi TChJ 
COULECIOHi  rHuCE  SECHEST 


POOH  OESCHJPTION:  TKACOUSTICS  PE-PauB  DOUBLE  CALLED  SOJM'  BOOTH 
-ICPOPHONE  tTPE;  ELECTP.t-voitE  PEl^;  OVNAhIC,  CAPoinio 


FILE  FOPHAT 
A)ChAPS  1 
9 )  Chap 
OChap  c; 
0)CHAP  f, 

F)Cm4P  7 

F) Chap  r 

G) Chap  e 
h)ChaP  to 
I )ChaPSi 1 


10  kh7,  3  is  «  KHZ) 


on  NOISE) 


DECODiNGt 

-3  :  SPEAKEPS  INITIALS 

:  phpase  mihhek  (1,?,.,.,9,a,b) 

I  SEX  (M  OH  E ) 

:  age  GpOUP  fi  IS  <m;  ?.  IS  13-P0;  3  IS  ?1-39J 
u  IS  ao-fep;  5  IS 

:  sample  PATE  f)  IS  l?.5  KHZ,  P  IS  . ^ 

!  Channel  type  (o  is  no  futep,  1  is  filtep) 

!  NOISE  (n  IS  NO  noise,  P  is  fOP  NOISE,  N  IS 
:  always  a 
•13:  file  Type; 

1  )0S1  IS  DIGITIZED  SPEECH 

^)CSy  is  CDNPPESSED  SPEECH  (y  is  FPAmE  PEPIOD) 
3)XYY  is  hitch  TPACKIYY  is  fpa^E  PEPIOD  anD  X 
IS  PITCH  ALGOPIThm*- 
1  )C0PPELATI0N  B)haphc)N1C 

I’lu-'iTT  CORPELATUiN  BIGOLD-PAHINEP 

,  ^  nu  ,,,  3)?-HIT  COWRELATTON  6)CEPSTPAL) 

U)MPH  IS  NOISE  DESCRIPTION  (w  IS  TYPE  K  DB  IS  S/N) 


JAPT  a; 


SPKPS:  1-?Q,  a?-5? 

OAIE:  late  .JULY  THHU  late  AUGUST,  197B 

'  INSTRUMENTS,  DALLAS,  TEXAS 

PFCDRDFP  lyPE;  TE  AC  A-uOlO  fAij  OR  Su?)?  1/a  TRACK,  7  i/?  Tps 
DIGIIIZE'O  DTPECTLY  USING:  980  AIDS:  NO  I  VAX  AIDS:  NO^ 

part  h: 


SPKRS:  38 
DATE:  1/19/79 

SPEECH  lab,  TFxas  INSTRUMENTS,  DALLAS, 

SL?),*  1/a  TRACK,  7  1/i 
DIGITIZE)  htPECTLY  USING:  9B0  AIDS:  NO  j  Va*  AIDS;  No 


TEXAS 

IPS 


PAPf  c: 


SPkPS;  50-37,  39-ai,  S3-SP 
DATE;  OCTOpfP-OECEmBER,  ia79 

location:  hiLLCREST  SPEECH  LAB,  TEXAS  INSTRUMENTS, 
RECORDER  TyBE;  NO  ANALOG  TAPE  RECORDINGS  made 
Ol&ITIZEO  DTPECTLY  USING;  9B0  AIDS:  NO  j  VAX  AI' 


DALLAS, Tx, 
;  YES 


TEXT  for  Data  base 


PI  rcH 


A-1 


•  -•  i  ■  AT  ^ ♦  ■  J  •  t  ■  ^.' 


(--Dl) 


M*«Y  H*0  A  little  lamb  ahOSE  FLFECE  was  *HlTt  AS  Sno... 
vE«v  FEW  Angels  ahe  always  wise  ano  pure,  (wop) 

THE  TRnuBLF  MlTM  SmImmING  IS  THAT  YOU  CAN  L'HOwn  (w03) 

NIYOH  MAS  taken  to  MOSCOW  PY  KlSSlNGFK'S  AIOF.  *(wD<J) 

WHICH  TEA  pamTV  DIO  BAKEP  GO  TO?  (wD5) 

an  example  oF  one  of  the  POY'S  personal  POINTS  IS  The  THINNESS  OE  NTS  H4«.n<s  /uo 
A  GOEy  P^'Tt.WF  IS  always  PPOVIDEO  THF  STUOENT  OF  ■ 
iMPOPTTiiT  ilMFSTIONS  WERE  DRAGGED  FPqn  the  SUBJECT 
ALMOST  everything  INVOLVED  MAKING  THE  CHRD  mi'nD. 

The  view  Of  The  present  will  LARGELY  BE  REACHED  In  THE  FfJLLfiwING  CFNTlikv 
The  WIFE'S  figure  HAD  ALREADY  ADJUSTED  BY  ITSELF,  (wDll) 


t  «r# 

rMR 


(wDQ) 


speaker  DIpfCTORY  for  DATA  BASE 


PITCH 


file  format 


SPEAKERS  NAME 


1.  RlDxm-^,  xx.OSl 

2,  CJCxm-ai  XX.DSl 
5.  BMHXF^I XX ,9S1 

«.  CwcxM^t XX ,nsi 
5,  JLSx*«ai  xx.OSl 
B,  T Jkxht* XX.OSt 
7.  RGLJi  MX1  XX.DSl 
«.  EFGXF^IXX.DSI 
R.  GROiM^t XX.OSl 

10,  JCLxmx, XX.OS1 

1 1 ,  Rhwxm^T  XX ,DS1 

12,  JLHX'^ai  xx.DSl 

14.  RNSxMxi XX.OSl 

ia,  kabxmxi xx.osi 

15.  REhxm^, XX.OSl 

16.  LFCXFui XX ,0S1 

17.  ALKXF/j,  xx.OSl 
IS,  DKdxPui XX.OSl 

19.  AOCxFt ixx.DSl 

20.  SROXFi 1 XX.OSl 

21,  MJOxFnXX.DSl 

22,  DROXM^ iXX.DSl 
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OPTIMIZED  CORRELATION  PITCH  TRACKER  FOR  SPEECH  SYSTEMS  APPLICATIONS 


George  Doddington  and  Bruce  Secrest 


Introduction 

A  pitch  extraction  algorithm  is  described  which  utilizes  a  segment  of 
speech. containing  several  pitch  frames.  The  decision  as  to  the  pitch  period 
and  voicing  for  a  given  frame  within  the  segment  is  deferred  to  the  end  of 
the  segment.  This  helps  overcome  anomalies  in  the  vocal  cord  vibrations 
within  the  segment  and  also  makes  the  method  robust  for  speech  imbedded  in 
moderate  levels  of  noise. 

The  algorithm  consists  of  three  basis  parts.  First,  a  correlation  tech¬ 
nique  is  used  to  obtain  the  periodicities  in  the  speech  to  be  used  as  candidate 
pitch  periods.  Next,  a  dynamic  programming  algorithm  using  these  candidate 
pitch  periods  is  used  to  preserve  the  continuity  of  the  pitch  track.  Then, 
as  a  last  step,  pattern  matching  with  the  correlation  values  from  the  optimal 
track  is  used  to  obtain  the  voiced/unvoiced  decision.  The  three  basic  steps 
of  the  pitch  extraction  algorithm  will  be  discussed  in  the  next  three  sections. 

Candidate  Pitch  Periods 

The  candidate  pitch  periods  for  each  pitch  frame  are  obtained  by  using  a 
normalized  correlation  technique.  Since  the  frame  of  data  to  be  analyzed  is 
imbedded  in  a  segment  of  speech,  a  forward  correlation  into  new  speech  is  ac¬ 
complished  as  well  as  a  reverse  correlation  into  old  speech.  This  allows  bette 
candidate  pitch  periods  in  regions  of  transition  such  as  from  nasals  to  vowels. 


Qiven  a  frame  of  speech  data  consisting  of  N  samples  (typically  10  milliseconds) 

the  first  M  samples  of  this  frame  (typically  10  milliseconds),  W^,  are  used  as  a 

sliding  window  for  the  forward  correlation  and  the  last  M  samples  of  the  frame,  W^, 

are  used  in  their  reverse  correlation.  The  normalized  cross-correlation,  R£(K), 

th  ^ 

between  the  sliding  window,  W^,  and  M  speech  samples  beginning  at  the  K  sample 
is  used  for  the  forward  correlation  and  is  defined  as 


where  x(i)  is  the  value  of  the  i*^*^  speech  sample  and  (=  2  milliseconds)  and 

*^max  milliseconds)  correspond  to  the  minimum  and  maximum  pitch  periods  to 

be  considered,  respectively.  Similarly,  the  normalized  cross-correlation,  R|.(K), 
between  the  sliding  window,  W  ,  and  M  speech  samples  earlier  in  time  starting  at 
the  (M  -  K  .  )  samples  is  used  for  the  reverse  correlation,  i.e.; 


Note  that  |  R^(K)  |  ^  1  and  |  R^(K)  ]  ^  1  because  of  the  normalization. 


Once  the  R^(K)  and  R^(K)  have  been  computed  from  equations  (1)  and  (2), 
a  set  of  candidate  pitch  periods,  S,  is  obtained  by  picking  those  values  of  K 
for  which  R^(K)  and  R^(K)  attain  a  maximum  or  peak  (R^(K)  and?[^(K).  These 
peaks  must  be  such  that  ^(^(K)  ^  .5  and  also  R^(K)  must  be  1.3  times  the  previous 
minimum  or  valley  in  the  function,  with  similar  constraints  on?(^(K). 

The  set  of  candidate  pitch  periods,  S,  is  enlarged  by  adding  the  half 
pitch  period,  K/2,  for  al  1  K  S  such  that  K  Kjj,  where  is  a  fixed  value 

(s  8  milliseconds).  Also  added  to  the  set  S  is  the  unvoiced  candidate  or  no 
pitch  period.  Thus,  if  no  maximum  or  peak  of  either  R^(K)  and  R^(K)  satisfy  the 
above  constraints,  the  set  S  contains  only  the  unvoiced  candidate. 


Optimal  Pitch  Track 

Given  the  set  S  of  candidate  pitch  periods  for  each  pitch  frame  in  the  seg¬ 
ment  of  speech,  it  is  desired  to  extract  a  pitch  period  for  each  frame  such  that 
the  pitch  track  is  continuous  across  the  entire  segment,  contains  the  pitch 
periods  with  the  higher  cross-correlation  values,  and  minimizes  pitch  period 
doubling.  A  dynamic  programming  algorithm  is  used  to  achieve  these  goals. 

The  dynamic  programming  algorithm  consists  of  T  trajectories  (T  =  4)  or 
tracks  through  each  pitch  frame  in  the  segment  of  speech.  At  each  pitch  frame, 
i,  the  trajectory  consists  of  a  pitch  period,  K-!,  the  value  of  a  cumulative 

•  I  1 

penalty,  pj,  and  a  back  pointer,  Bj,  to  that  trajectory  in  the  previous  frame 
resulting  in  the  minimum  cumulative  penalty. 

To  extend  the  trajectories  to  the  current  (i  +  1)®^  frame,  each  element, 

K.  ^  of  the  set  S  of  candidate  pitch  periods  for  the  current  frame  is  compared 
with  all  T  trajectories  of  the  previous  frame.  This  comparison  consists  of 
assessing  a  penalty  in  going  from  the  i^^  frame  to  the  (i  +1)*^  frame.  The 
cumulative  penalty  at  the  (i  +  1)*^  frame  using  the  j^*^  trajectory  of  the  i^*^ 
frame  is  given  as; 


where  H 
,it 


+  1 


(K.  ^  p  is  the  cumulative  penalty  for  the  j 


.  th 


trajectory  at  the 


(i  +  1)  frame  using  the  candidate  pitch  period  K.  ,  from  set  S  at  frame 

•  4.U  1  +  I  4.1. 


(i  +1);  H  is  the  cumulative  penalty  for  the  trajectory  at  the  i'^”  frame; 


.  th 


E.  ^  ^  is  the  RMS  energy  in  the  sliding  window,  W^,  at  the  i  frame,  Kj,  are  also 
extended  into  the  (i  +  1)*^  frame  with  a  constant  penalty  being  added. 


(H  ^,(Kj) 


p\  (K.)  +  1  -  .5  +  (.003) (40)) 


At  any  frame,  the  set  of  cumulative  penalties  obtained  by  the  method 
described  above  is  search  to  find  the  T  minimum  cumulative  penalties.  These  T 
trajectories  are  then  saved  for  that  frame  to  be  used  in  extending  to  the  next 
f  rame . 


Another  way  to  look  at  the  dynamic  programming  approach  used  in  the  algorithm 
is  to  say  that  in  order  to  maintain  pitch  track  continuity  across  the  speech  segment, 
several  frames  (at  least  four)  are  analyzed  before  deciding  upon  the  first  frame. 

At  each  frame,  every  pitch  candidate  is  compared  to  the  retained  pitch  candidates 
of  the  previous  frame  (only  four  pitch  candidates  are  retained  for  each  frame). 

Each  comparison  results  in  a  cumulative  penalty  and  there  will  be  a  smallest  penalty 
for  each  of  the  candidates  in  the  new  frame  corresponding  to  a  comparison  to  one 
of  the  retained  pitch  candidates  of  the  previous  frame.  In  addition,  each  pitch 
candidate  of  the  previous  frame  is  also  a  candidate  in  the  new  frame  with  a  fixed 
increasein  cumulative  penalty.  When  the  lowest  cumulative  penalty  has  been 
calculated  for  all  new  candidates,  the  four  with  the  lowest  cumulative  penalties 
are  retained,  along  with  their  cumulative  penalties,  correlation  peak  values  and 
back  pointers.  The  back  pointer  of  a  pitch  candidate  indicates  which  candidate  of 


the  previous  frame  corresponds  to  its  cumulative  penalty.  Likewise  that  candidate 

in  the  previous  frame  has  a  back  pointer  identifying  another  candidate  in  the 

frame  before  it,  etc.  Thus  the  back  pointers  define  a  trajectory  which  has  the 

associated  cumulative  penalty  of  the  last  analyzed  frame.  The  cumulative  penalty 

at  the  (i  +  1)*^  frame  of  the  trajectory  is  given  by  equation  (3).  After  the 

four  candidates  and  associated  parameters  of  a  pitch  frame  have  been  obtained,  that 

trajectory  with  the  lowest  cumulative  penalty,  P?  is  selected  as  correct.  It  is 

traced  backward  m  frames  (at  least  four)  to  find  the  pitch  value,  K?  _  ,  identified 

th  1  “  ni 

as  the  pitch  during  the  (i  -  m)  frame. 

At  the  end  of  the  segment  of  speech,  the  T  trajectories  in  the  last 
frame  are  searched  for  the  minimum  cumulative  penalty.  The  trajectory 
described  by  the  backpointers  is  called  the  optimal  pitch  track  for  that  segment 
of  speech. 

Voiced/Unvoiced  Decision 

Given  the  optimal  pitch  track  from  the  dynamic  programming  algorithm,  the 
correlation  values  of  the  pitch  periods  of  this  optimal  path  are  scanned  to  make 
a  voiced/unvoiced  decision  at  each  frame. 

The  scanning  patterns  are  meant  to  span  L  (=  4)  frames  of  the  segment  of 
speech,  which  corresponds  to  L  time  periods.  The  motivation  for  the  scanning 
patterns  is  that  when  the  speech  is  unvoiced,  the  correlation  values  should  be 
high.  Note  that  the  correlation  values  for  the  optimal  pitch  track  will  vary 
from  .5  to  l.O.  Tn  determining  changes  from  voiced  to  unvoiced  speech,  the 
correlation  values  would  expect  to  decrease  from  a  high  value  to  a  low  value  in 
a  few  time  frames,  and  vice  versa  for  the  unvoiced  to  voiced  transition.  With 
this  in  mind,  four-point  scanning  pattern  vectors  might  look  like  (L  =  4): 

=  {.5,  .5,  .5,  .5}  (Unvoiced)  (4) 

Py  -  {.9,  .9,  .9,  .9}  (Voiced) 

^/UV  =  (.8,  .8,  .8,  .8}  (Voiced  to  unvoiced  transition) 

— UVV  *  (Unvoiced  to  voiced  transition) 

Four  errors  are  determined  for  each  frarn*  of  speech  by  centering  the  scanning 
patterns  on  the  second  element  of  the  vector  and  computing  a  squared  error  between 
the  scanning  pattern  and  the  correlation  values  of  the  optimal  pitch  track,  i.e.: 


j  =  1 


[  '  I-2+j^  “  P|  J  I  =  UV,  V,  VUV,  UVV 


where  eJ  is  the  scanning  error  for  the  i^^  frame  for  one  of  the  four  scanning 
patterns:  P^z  fyuv'  Aivv’  correlation  value  for  the 

pitch  period  at  the  i^^’  frame  contained  in  the  optimal  pitch  track;  and  p^  is 
.th  ,  _  ,  th  I 

the  1  element  of  the  I  scanning  pattern. 


The  voiced/unvoiced  decision  is  made  by  comparing  the  scanning  errors 

against  fixed  thresholds.  If  the  sequence  is  |  voiced  \  _  ^^st 

^  {  I  lunvoicedJ 

E  I 

frame,  then  the  v  J  scanning  error  is  compared  with  a  fixed  threshold  (=  .4). 

If  this  error  is  less  than  the  threshold,  the  i^  frame  is  changed  to 

’  ’  L  voiced 

If  this  error  is  also  larger  than  the  threshold,  the  decision  is  deferred.  The 

above  strategy  is  continued  until  a  frame  either  is  confirmed  as  being  the  same 

i.e.  I  voiced  \  gpy  intermediate  frames  which  were  deferred  are  made 

LunvoicedJ 

f  voiced  \  However,  if  the  voicing  decision  has  changed,  i  .e.  and 

LunvoicedJ  L  voiced  J 

there  are  intermediate  frames  which  are  unresolved,  then  the  scanning  pattern 
errors  .are  searched  for  their  minimum  value  at  these  intermediate  frames 

e‘ 

UVV 

and  the  transition  point  is  set  at  this  minimum  point. 


